Introduction

This IPython notebook illustrates how to perform matching using the rule-based matcher.

First, we need to import py_entitymatching package and other libraries as follows:


In [2]:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

Then, read the (sample) input tables for matching purposes.


In [3]:
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

path_A = datasets_dir + os.sep + 'dblp_demo.csv'
path_B = datasets_dir + os.sep + 'acm_demo.csv'
path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'

In [5]:
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')

# Load the pre-labeled data
S = em.read_csv_metadata(path_labeled_data, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')
S.head()


Out[5]:
_id ltable_id rtable_id ltable_title ltable_authors ltable_year rtable_title rtable_authors rtable_year label
0 0 l1223 r498 Dynamic Information Visualization Yannis E. Ioannidis 1996 Dynamic information visualization Yannis E. Ioannidis 1996 1
1 1 l1563 r1285 Dynamic Load Balancing in Hierarchical Parallel Database Systems Luc Bouganim, Daniela Florescu, Patrick Valduriez 1996 Dynamic Load Balancing in Hierarchical Parallel Database Systems Luc Bouganim, Daniela Florescu, Patrick Valduriez 1996 1
2 2 l1514 r1348 Query Processing and Optimization in Oracle Rdb Gennady Antoshenkov, Mohamed Ziauddin 1996 prospector: a content-based multimedia server for massively parallel architectures S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader 1996 0
3 3 l206 r1641 An Asymptotically Optimal Multiversion B-Tree Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger 1996 A complete temporal relational algebra Debabrata Dey, Terence M. Barron, Veda C. Storey 1996 0
4 4 l1589 r495 Evaluating Probabilistic Queries over Imprecise Data Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar 2003 Evaluating probabilistic queries over imprecise data Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar 2003 1

Then, split the labeled data into development set and evaluation set. Use the development set to select the best learning-based matcher


In [6]:
# Split S into I an J
IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)
I = IJ['train']
J = IJ['test']

Creating and Using a Rule-Based Matcher

This, typically involves the following steps:

  1. Creating the rule-based matcher
  2. Creating features
  3. Adding Rules
  4. Using the Matcher to Predict Results

Creating the Rule-Based Matcher


In [7]:
brm = em.BooleanRuleMatcher()

Creating Features

Next, we need to create a set of features for the development set. Magellan provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.


In [8]:
# Generate a set of features
F = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)

We observe that there were 20 features generated. As a first step, lets say that we decide to use only 'year' related features.


In [9]:
F.feature_name


Out[9]:
0                          id_id_lev_dist
1                           id_id_lev_sim
2                               id_id_jar
3                               id_id_jwn
4                               id_id_exm
5                   id_id_jac_qgm_3_qgm_3
6             title_title_jac_qgm_3_qgm_3
7         title_title_cos_dlm_dc0_dlm_dc0
8                         title_title_mel
9                    title_title_lev_dist
10                    title_title_lev_sim
11        authors_authors_jac_qgm_3_qgm_3
12    authors_authors_cos_dlm_dc0_dlm_dc0
13                    authors_authors_mel
14               authors_authors_lev_dist
15                authors_authors_lev_sim
16                          year_year_exm
17                          year_year_anm
18                     year_year_lev_dist
19                      year_year_lev_sim
Name: feature_name, dtype: object

Adding Rules

Before we can use the rule-based matcher, we need to create rules to evaluate tuple pairs. Each rule is a list of strings. Each string specifies a conjunction of predicates. Each predicate has three parts: (1) an expression, (2) a comparison operator, and (3) a value. The expression is evaluated over a tuple pair, producing a numeric value.


In [10]:
# Add two rules to the rule-based matcher

# The first rule has two predicates, one comparing the titles and the other looking for an exact match of the years
brm.add_rule(['title_title_lev_sim(ltuple, rtuple) > 0.4', 'year_year_exm(ltuple, rtuple) == 1'], F)
# This second rule compares the authors
brm.add_rule(['authors_authors_lev_sim(ltuple, rtuple) > 0.4'], F)
brm.get_rule_names()


Out[10]:
['_rule_0', '_rule_1']

In [11]:
# Rules can also be deleted from the rule-based matcher
brm.delete_rule('_rule_1')


Out[11]:
True

Using the Matcher to Predict Results

Now that our rule-based matcher has some rules, we can use it to predict whether a tuple pair is actually a match. Each rule is is a conjunction of predicates and will return True only if all the predicates return True. The matcher is then a disjunction of rules and if any one of the rules return True, then the tuple pair will be a match.


In [12]:
brm.predict(S, target_attr='pred_label', append=True)
S


Out[12]:
_id ltable_id rtable_id ltable_title ltable_authors ltable_year rtable_title rtable_authors rtable_year label pred_label
0 0 l1223 r498 Dynamic Information Visualization Yannis E. Ioannidis 1996 Dynamic information visualization Yannis E. Ioannidis 1996 1 1
1 1 l1563 r1285 Dynamic Load Balancing in Hierarchical Parallel Database Systems Luc Bouganim, Daniela Florescu, Patrick Valduriez 1996 Dynamic Load Balancing in Hierarchical Parallel Database Systems Luc Bouganim, Daniela Florescu, Patrick Valduriez 1996 1 1
2 2 l1514 r1348 Query Processing and Optimization in Oracle Rdb Gennady Antoshenkov, Mohamed Ziauddin 1996 prospector: a content-based multimedia server for massively parallel architectures S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader 1996 0 0
3 3 l206 r1641 An Asymptotically Optimal Multiversion B-Tree Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger 1996 A complete temporal relational algebra Debabrata Dey, Terence M. Barron, Veda C. Storey 1996 0 0
4 4 l1589 r495 Evaluating Probabilistic Queries over Imprecise Data Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar 2003 Evaluating probabilistic queries over imprecise data Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar 2003 1 1
5 5 l43 r1415 Optimization of Run-time Management of Data Intensive Web-sites Khaled Yagoub, Dan Suciu, Alon Y. Levy, Daniela Florescu 1999 On random sampling over joins Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya 1999 0 0
6 6 l1466 r1348 Access Path Support for Referential Integrity in SQL2 Joachim Reinert, Theo Hrder 1996 prospector: a content-based multimedia server for massively parallel architectures S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader 1996 0 0
7 7 l1535 r1800 Mariposa: A Wide-Area Distributed Database System Carl Staelin, Paul M. Aoki, Witold Litwin, Michael Stonebraker, Adam Sah, Jeff Sidell, Andrew Yu... 1996 Further Improvements on Integrity Constraint Checking for Stratifiable Deductive Databases Sin Yeung Lee, Tok Wang Ling 1996 0 0
8 8 l1317 r1676 QuickStore: A High Performance Mapped Object Store David J. DeWitt, Seth J. White 1994 An Overview of Repository Technology Philip A. Bernstein, Umeshwar Dayal 1994 0 0
9 9 l621 r175 Communication Efficient Distributed Mining of Association Rules Ran Wolff, Assaf Schuster 2001 Editorial Richard Snodgrass 2001 0 0
10 10 l668 r1694 Indexing Multimedia Databases (Tutorial) Christos Faloutsos 1995 Information finding in a digital library: the Stanford perspective Tak W. Yan, Héctor García-Molina 1995 0 0
11 11 l1189 r1674 Weimin Du, Xiangning Liu, Abdelsalam Helal Multiview Access Protocols for Large-Scale Replication 1998 Multiview access protocols for large-scale replication Xiangning Liu, Abdelsalam Helal, Weimin Du 1998 1 0
12 12 l1657 r110 Semantic B2B Integration Christoph Bussler 2001 Monitoring business processes through event correlation based on dependency model Asaf Adii, David Botzer, Opher Etzion, Tali Yatzkar-Haham 2001 0 0
13 13 l1490 r599 Extracting Large Data Sets using DB2 Parallel Edition Sriram Padmanabhan 1996 Extracting Large Data Sets using DB2 Parallel Edition Sriram Padmanabhan 1996 1 1
14 14 l595 r87 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? (Panel) Kyuseok Shim, Rajeev Rastogi, Minos N. Garofalakis, Sridhar Ramaswamy 1999 Of crawlers, portals, mice, and men: is there more to mining the Web? Minos N. Garofalakis, Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim 1999 1 1
15 15 l380 r1337 Outerjoin Simplification and Reordering for Query Optimization Csar A. Galindo-Legaria, Arnon Rosenthal 1997 Outerjoin simplification and reordering for query optimization César Galindo-Legaria, Arnon Rosenthal 1997 1 1
16 16 l165 r1118 Cache-and-Query for Wide Area Sensor Databases Phillip B. Gibbons, Srinivasan Seshan, Suman Kumar Nath, Amol Deshpande 2003 Cache-and-query for wide area sensor databases Amol Deshpande, Suman Nath, Phillip B. Gibbons, Srinivasan Seshan 2003 1 1
17 17 l796 r588 Generating Dynamic Content at Database-Backed Web Servers: cgi-bin vs. mod_perl Alexandros Labrinidis, Nick Roussopoulos 2000 Novel Approaches in Query Processing for Moving Object Trajectories Dieter Pfoser, Christian S. Jensen, Yannis Theodoridis 2000 0 0
18 18 l1160 r1733 Khaled Alsabti, Vineet Singh, Sanjay Ranka A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data 1997 A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data Khaled Alsabti, Sanjay Ranka, Vineet Singh 1997 1 0
19 19 l1752 r3 SHORE: Combining the Best Features of OODBMS and File Systems Shore Team 1995 The LyriC language: querying constraint objects Alexander Brodsky, Yoram Kornatzky 1995 0 0
20 20 l1647 r945 Cost Based Query Scrambling for Initial Delays Tolga Urhan, Michael J. Franklin, Laurent Amsaleg 1998 The Cubetree Storage Organization Nick Roussopoulos, Yannis Kotidis 1998 0 0
21 21 l1135 r1127 Sampling-Based Estimation of the Number of Distinct Values of an Attribute Peter J. Haas, Lynne Stokes, S. Seshadri, Jeffrey F. Naughton 1995 View maintenance in a warehousing environment Yue Zhuge, Héctor García-Molina, Joachim Hammer, Jennifer Widom 1995 0 0
22 22 l1776 r987 Walking Through a Very Large Virtual Environment in Real-time Yixin Ruan, Kian-Lee Tan, Jason Chionh, Lidan Shou, Zhiyong Huang 2001 Walking Through a Very Large Virtual Environment in Real-time Lidan Shou, Jason Chionh, Zhiyong Huang, Yixin Ruan, Kian-Lee Tan 2001 1 1
23 23 l676 r1395 Datawarehousing Has More Colours Than Just Black & White Thomas Zurek, Markus Sinnwell 1999 Datawarehousing Has More Colours Than Just Black &; White Thomas Zurek, Markus Sinnwell 1999 1 1
24 24 l1087 r648 The Grid: An Application of the Semantic Web Carole A. Goble, David De Roure 2002 An XML query engine for network-bound data Zachary G. Ives, A. Y. Halevy, D. S. Weld 2002 0 0
25 25 l629 r1478 Engineering Federated Information Systems: Report of EFIS '99 Workshop Flix Saltor, Uwe Hohenstein, Ralf-Detlef Kutsche, Wilhelm Hasselbring, Gunter Saake, Stefan Conr... 1999 Engineering federated information systems: report of EEFIS '99 workshop S. Conrad, W. Hasselbring, U. Hohenstein, R.-D. Kutsche, M. Roantree, G. Saake, F. Saltor 1999 1 1
26 26 l649 r1366 Random Sampling for Histogram Construction: How much is enough? Vivek R. Narasayya, Rajeev Motwani, Surajit Chaudhuri 1998 Random sampling for histogram construction: how much is enough? Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya 1998 1 1
27 27 l211 r1490 BeSS: Storage Support for Interactive Visualization Systems William O'Connell, Thomas A. Funkhouser, Alexandros Biliris, Euthimios Panagos 1996 BeSS: storage support for interactive visualization systems A. Biliris, T. A. Funkhouser, W. O'Connell, E. Panagos 1996 1 1
28 28 l734 r384 Min-Max Compression Methods for Medical Image Databases John M. Tyler, Kosmas Karadimitriou 1997 Min-max compression methods for medical image databases Kosmas Karadimitriou, John M. Tyler 1997 1 1
29 29 l611 r141 Mining Generalized Association Rules Ramakrishnan Srikant, Rakesh Agrawal 1995 Multi-table joins through bitmapped join indices Patrick O'Neil, Goetz Graefe 1995 0 0
... ... ... ... ... ... ... ... ... ... ... ...
420 420 l834 r883 Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga, Jeffrey F. Naughton, Alaa R. Alameldeen 2001 Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga, Alaa R. Alameldeen, Jeffrey F. Naughton 2001 1 1
421 421 l746 r301 Providing Database Migration Tools - A Practicioner's Approach Andreas Meier 1995 Providing Database Migration Tools - A Practicioner's Approach Andreas Meier 1995 1 1
422 422 l1332 r619 Workshop on Workflow Management in Scientific and Engineering Applications - Report Gottfried Vossen, Richard McClatchey 1997 Workshop on workflow management in scientific and engineering applications-report R. McClatchey, G. Vossen 1997 1 1
423 423 l942 r1473 Research in Databases and Data-Intensive Applications - Computer Science Department and FZI, Uni... Birgitta Knig-Ries, Peter C. Lockemann 1997 Research in databases and data-intensive applications: Computer Science Dept. and FIZ, Universit... Brigitta König-Ries, Peter C. Lockermann 1997 1 1
424 424 l806 r356 Tribeca: A Stream Database Manager for Network Traffic Analysis Mark Sullivan 1996 Type-safe relaxing of schema consistency rules for flexible modelling in OODBMS Eric Amiel, Marie-Jo Bellosta, Eric Dujardin, Eric Simon 1996 0 0
425 425 l794 r784 Spatial Data Management for Computer Aided Design Andreas Mller, Marco Ptke, Thomas Seidl, Hans-Peter Kriegel 2001 Dynamic content acceleration: a caching solution to enable scalable dynamic Web page generation Anindya Datta, Kaushik Dutta, Krithi Ramamritham, Helen Thomas, Debra VanderMeer 2001 0 0
426 426 l28 r1618 Storage Technology: RAID and Beyond Garth A. Gibson 1995 Tutorial on storage technology: RAID and beyond Garth A. Gibson 1995 1 1
427 427 l1183 r1409 Stephen Blott, Roger Weber, Hans-Jrg Schek A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional ... 1998 A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional ... Roger Weber, Hans-Jörg Schek, Stephen Blott 1998 1 0
428 428 l1122 r232 Interview with Jim Gray Marianne Winslett 2003 In-context peer-to-peer information filtering on the Web Aris M. Ouksel 2003 0 0
429 429 l1430 r1444 Condition Handling in SQL Persistent Stored Modules Jeff Richey 1995 Condition handling in SQL persistent stored modules Jeff Richey 1995 1 1
430 430 l1494 r1257 The Mariposa Distributed Database Management System Jeff Sidell 1996 Open issues in parallel query optimization Waqar Hasan, Daniela Florescu, Patrick Valduriez 1996 0 0
431 431 l1592 r439 Report on the 18th British National Conference on Databases (BNCOD) Carole A. Goble, Brian J. Read 2002 Contracting in the days of eBusiness W. Hümmer, W. Lehner, H. Wedekind 2002 0 0
432 432 l1015 r45 Database Systems - Breaking Out of the Box Abraham Silberschatz, Stanley B. Zdonik 1997 Dynamic Memory Adjustment for External Mergesort Weiye Zhang, Per-Åke Larson 1997 0 0
433 433 l1147 r1016 Xiaolei Qian Scientist's Called Upon to Take Actions 1996 Scientists called upon to take actions Xiaolei Qian 1996 1 0
434 434 l1756 r310 ARIES/CSA: A Method for Database Recovery in Client-Server Architectures C. Mohan, Inderpal Narang 1994 Enterprise information architectures-they're finally changing Wesley P. Melling 1994 0 0
435 435 l1044 r67 Digital Library Services in Mobile Computing Evaggelia Pitoura, Melliyal Annamalai, Bharat K. Bhargava 1995 Ordered shared locks for real-time databases Divyakant Agrawal, Amr El Abbadi, Richard Jeffers, Lijing Lin 1995 0 0
436 436 l412 r651 Phoenix: Making Applications Robust David B. Lomet, Roger S. Barga 1999 DataBlitz storage manager: main-memory database performance for critical applications J. Baulier, P. Bohannon, S. Gogate, C. Gupta, S. Haldar 1999 0 0
437 437 l796 r1808 Generating Dynamic Content at Database-Backed Web Servers: cgi-bin vs. mod_perl Alexandros Labrinidis, Nick Roussopoulos 2000 On wrapping query languages and efficient XML integration Vassilis Christophides, Sophie Cluet, Jérǒme Simèon 2000 0 0
438 438 l1570 r1468 Instance-based attribute identification in database integration Roger H. L. Chiang, Ee-Peng Lim, Chua Eng Huang Cecil 2003 Index-driven similarity search in metric spaces Gisli R. Hjaltason, Hanan Samet 2003 0 0
439 439 l1577 r688 Data Mining Using Two-Dimensional Optimized Accociation Rules: Scheme, Algorithms, and Visualiza... Shinichi Morishita, Yasuhiko Morimoto, Takeshi Tokuyama, Takeshi Fukuda 1996 Static detection of security flaws in object-oriented databases Keishi Tajima 1996 0 0
440 440 l617 r310 Fine-Grained Sharing in a Page Server OODBMS Michael J. Carey, Markos Zaharioudakis, Michael J. Franklin 1994 Enterprise information architectures-they're finally changing Wesley P. Melling 1994 0 0
441 441 l1304 r1178 Query Rewriting for Semistructured Data Vasilis Vassalos, Yannis Papakonstantinou 1999 The Aqua approximate query answering system Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy 1999 0 0
442 442 l727 r597 Design and Analysis of Parametric Query Optimization Algorithms Sumit Ganguly 1998 Incremental distance join algorithms for spatial databases Gísli R. Hjaltason, Hanan Samet 1998 0 0
443 443 l1205 r395 Proxy-Server Architectures for OLAP Panos Kalnis, Dimitris Papadias 2001 Proxy-server architectures for OLAP Panos Kalnis, Dimitris Papadias 2001 1 1
444 444 l915 r1532 Efficient k-NN search on vertically decomposed data Niels Nes, Martin L. Kersten, Nikos Mamoulis, Arjen P. de Vries 2002 Efficient k-NN search on vertically decomposed data Arjen P. de Vries, Nikos Mamoulis, Niels Nes, Martin Kersten 2002 1 1
445 445 l365 r53 50,000 Users on an Oracle8 Universal Server Database Ashok Josji, Tirthankar Lahiri, Amit Jasuja, Sumanta Chatterjee 1998 A workflow-based electronic marketplace on the Web Asuman Dogac, Ilker Durusoy, Sena Arpinar, Nesime Tatbul, Pinar Koksal, Ibrahim Cingil, Nazife D... 1998 0 0
446 446 l458 r767 Comparing Hierarchical Data in External Memory Sudarshan S. Chawathe 1999 Context-Based Prefetch for Implementing Objects on Relations Philip A. Bernstein, Shankar Pal, David Shutt 1999 0 0
447 447 l655 r412 The SDSS skyserver: public access to the sloan digital sky server data Tanu Malik, Jordan Raddick, Alexander S. Szalay, Peter Z. Kunszt, Jim Gray, Christopher Stoughto... 2002 Report on the ACM fourth international workshop on data warehousing and OLAP (DOLAP 2001) Joachim Hammer 2002 0 0
448 448 l123 r1493 Change-Centric Management of Versions in an XML Warehouse Laurent Mignet, Amlie Marian, Gregory Cobena, Serge Abiteboul 2001 A Sequential Pattern Query Language for Supporting Instant Data Mining for e-Services Reza Sadri, Carlo Zaniolo, Amir M. Zarkesh, Jafar Adibi 2001 0 0
449 449 l590 r295 Skew handling techniques in sort-merge join Richard T. Snodgrass, Wei Li, Dengfeng Gao 2002 QURSED: querying and reporting semistructured data Yannis Papakonstantinou, Michalis Petropoulos, Vasilis Vassalos 2002 0 0

450 rows × 11 columns


In [ ]: